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Abstract 

Feature representations, both hand-designed and 
learned ones, are often hard to analyze and interpret, even 
when they are extracted from visual data. We propose a 
new approach to study image representations by inverting 
them with an up-convolutional neural network. We apply 
the method to shallow representations (HOG, SIFT, LBP), 
as well as to deep networks. For shallow representations 
our approach provides significantly better reconstructions 
than existing methods, revealing that there is surprisingly 
rich information contained in these features. Inverting a 
deep network trained on ImageNet provides several insights 
into the properties of the feature representation learned 
by the network. Most strikingly, the colors and the rough 
contours of an image can be reconstructed from activations 
in higher network layers and even from the predicted class 
probabilities. 


1. Introduction 

A feature representation useful for pattern recognition 
tasks is expected to concentrate on properties of the input 
image which are important for the task and ignore the ir¬ 
relevant properties of the input image. For example, hand- 
designed descriptors such as HOG [3] or SIFT [17], explic¬ 
itly discard the absolute brightness by only considering gra¬ 
dients, precise spatial information by binning the gradients 
and precise values of the gradients by normalizing the his¬ 
tograms. Convolutional neural networks (CNNs) trained in 
a supervised manner [14, 13] are expected to discard infor¬ 
mation irrelevant for the task they are solving [28, 19, 22]. 

In this paper we propose a new approach to analyze 
which information is preserved by a feature representa¬ 
tion and which information is discarded. We train neural 
networks to invert feature representations in the following 
sense. Given a feature vector, the network is trained to 
predict the expected pre-image, that is, the (weighted) av¬ 
erage of all natural images which could have produced the 
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Figure 1: We train convolutional networks to reconstruct 
images from different feature representations. Top row: 
Input features. Bottom row: Reconstructed image. Re¬ 
constructions from HOG and SIFT are very realistic. Re¬ 
constructions from AlexNet preserve color and rough object 
positions even when reconstructing from higher layers. 


given feature vector. The content of this expected pre-image 
shows image properties which can be confidently inferred 
from the feature vector. The amount of blur corresponds to 
the level of invariance of the feature representation. We ob¬ 
tain further insights into the structure of the feature space, as 
we apply the networks to perturbed feature vectors, to inter¬ 
polations between two feature vectors, or to random feature 
vectors. 

We apply our inversion method to AlexNet [13], a con¬ 
volutional network trained for classification on ImageNet, 
as well as to three widely used computer vision features: 
histogram of oriented gradients (HOG) [3, 7], scale invari¬ 
ant feature transform (SIFT) [17], and local binary pat¬ 
terns (LBP) [21]. The SIFT representation comes as a non- 
uniform, sparse set of oriented keypoints with their corre¬ 
sponding descriptors at various scales. This is an additional 
challenge for the inversion task. LBP features are not dif¬ 
ferentiable with respect to the input image. Thus, existing 
methods based on gradients of representations [19] could 
not be applied to them. 
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1.1. Related work 

Our approach is related to a large body of work on in¬ 
verting neural networks. These include works making use 
of backpropagation or sampling [15, 16, 18, 27, 9, 25] and, 
most similar to our approach, other neural networks [2]. 
However, only recent advances in neural network architec¬ 
tures allow us to invert a modern large convolutional net¬ 
work with another network. 

Our approach is not to be confused with the Decon- 
vNet [28], which propagates high level activations back¬ 
ward through a network to identify parts of the image re¬ 
sponsible for the activation. In addition to the high-level 
feature activations, this reconstruction process uses extra 
information about maxima locations in intermediate max¬ 
pooling layers. This information has been shown to be cru¬ 
cial for the approach to work [22] . A visualization method 
similar to DeconvNet is by Springenberg et al. [22], yet it 
also makes use of intermediate layer activations. 

Mahendran and Vedaldi [19] invert a differentiable im¬ 
age representation ^ using gradient descent. Given a fea¬ 
ture vector ^ 0 , they seek for an image x* which minimizes 
a loss function - the squared Euclidean distance between 
T>o and T>(x) plus a regularizer enforcing a natural image 
prior. This method is fundamentally different from our ap¬ 
proach in that it optimizes the difference between the fea¬ 
ture vectors, not the image reconstruction error. Addition¬ 
ally, it includes a hand-designed natural image prior, while 
in our case the network implicitly learns such a prior. Tech¬ 
nically, it involves optimization at test time, which requires 
computing the gradient of the feature representation and 
makes it relatively slow (the authors report 6s per image on 
a GPU). In contrast, the presented approach is only costly 
when training the inversion network. Reconstruction from 
a given feature vector just requires a single forward pass 
through the network, which takes roughly 5ms per image on 
a GPU. The method of [19] requires gradients of the feature 
representation, therefore it could not be directly applied to 
non-differentiable representations such as EBP, or record¬ 
ings from a real brain [20]. 

There has been research on inverting various tradi¬ 
tional computer vision representations: HOG and dense 
SIET [24], keypoint-based SIET [26], Local Binary De¬ 
scriptors [4], Bag-of-Visual-Words [11]. All these meth¬ 
ods are either tailored for inverting a specific feature repre¬ 
sentation or restricted to shallow representations, while our 
method can be applied to any feature representation. 

2. Method 

Denote by (x, 0) random variables representing a natu¬ 
ral image and its feature vector, and denote their joint prob¬ 
ability distribution by p(x, 0) = p{x.)p{(f)\x.). Here p(x) is 
the distribution of natural images andp((/)|x) is the distribu¬ 


tion of feature vectors given an image. As a special case, 0 
may be a deterministic function of x. Ideally we would like 
to find p(x|(/)), but direct application of Bayes’ theorem is 
not feasible. Therefore in this paper we resort to a point es¬ 
timate /(0) which minimizes the following mean squared 
error objective: 

Ex.0||x-/(^)||^ (1) 

The minimizer of this loss is the conditional expectation: 

/(0o) = Ex [x| </) = </)()], (2) 

that is, the expected pre-image. 

Given a training set of images and their features 
{x^, 0^}, we learn the weights w of an an up-convolutional 
network /((/), w) to minimize a Monte-Carlo estimate of 
the loss (1): 

w = argmin V ||xi -/(</)i,w)||2. (3) 

W ^^ 

i 

This means that simply training the network to predict im¬ 
ages from their feature vectors results in estimating the ex¬ 
pected pre-image. 

2.1. Feature representations to invert 

Shallow features. We invert three traditional computer 
vision feature representations: histogram of oriented gradi¬ 
ents (HOG), scale invariant feature transform (SIET), and 
local binary patterns (EBP). We chose these features for a 
reason. There has been work on inverting HOG, so we can 
compare to existing approaches. EBP is interesting because 
it is not differentiable, and hence gradient-based methods 
cannot invert it. SIET is a keypoint-based representation, 
so the network has to stitch different keypoints into a single 
smooth image. 

Eor all three methods we use implementations from the 
VLFeat library [23] with the default settings. More pre¬ 
cisely, we use the HOG version from Eelzenszwalb et al. [ 7 ] 
with cell size 8, the version of SIET which is very similar 
to the original implementation of Lowe [17] and the EBP 
version similar to Ojala et al. [21] with cell size 16. Be¬ 
fore extracting the features we convert images to grayscale. 
More details can be found in the supplementary material. 

AlexNet. We also invert the representation of the 
AlexNet network [13] trained on ImageNet, available at 
the Caffe [10] website. ^ It consists of 5 convolutional lay¬ 
ers and 3 fully connected layers, with rectified linear units 
(ReLUs) after each layer, and local contrast normalization 
or max-pooling after some of them. Exact architecture is 
shown in the supplementary material. In what follows, 

^More precisely, we used CaffeNet, which is almost identical to the 
original AlexNet. 



when we say ‘output of the layer’, we mean the output of the 
last processing step of this layer. For example, the output of 
the first convolutional layer CONVl would be the result af¬ 
ter ReLU, pooling and normalization, and the output of the 
first fully connected layer FC6 is after ReLU. FC8 denotes 
the last layer, before the softmax. 

2.2. Network architectures and training 

An up-convolutional layer, also often referred to as ‘de- 
convolutional’, is a combination of upsampling and convo¬ 
lution [6] . We upsample a feature map by a factor 2 by re¬ 
placing each value by a 2 x 2 block with the original value 
in the top left corner and all other entries equal to zero. Ar¬ 
chitecture of one of our up-convolutional networks is shown 
in Table 1 . Architectures of other networks are shown in the 
supplementary material. 

HOG and LBP. For an image of size W x H, HOG 
and LBP features of an image form 3-dimensional arrays of 
sizes [VF/S] x \H/8] x 31 and |"1U/16] x \H/16] x 58, 
respectively. We use similar CNN architectures for invert¬ 
ing both feature representations. The networks include a 
contracting part, which processes the input features through 
a series of convolutional layers with occasional stride of 2, 
resulting in a feature map 64 times smaller than the input 
image. Then the expanding part of the network again up¬ 
samples the feature map to the full image resolution by a se¬ 
ries of up-convolutional layers. The contracting part allows 
the network to aggregate information over large regions of 
the input image. We found this is necessary to successfully 
estimate the absolute brightness. 

Sparse SIFT. Running the SIFT detector and descrip¬ 
tor on an image gives a set of N keypoints, where the i-th 
keypoint is described by its coordinates (xi^i/i), scale Si, 
orientation ai, and a feature descriptor f^ of dimensionality 
D. In order to apply a convolutional network, we arrange 
the keypoints on a grid. We split the image into cells of 
size d X d (we used d = 4 in our experiments), this yields 
\W/d] X \H/d] cells. In the rare cases when there are 
several keypoints in a cell, we randomly select one. We 
then assign a vector to each of the cells: a zero vector to 
a cell without a keypoint and a vector (fi,^^ mod d^yi 
mod d, sin cos log 5^) to a cell with a keypoint. This 
results in a feature map F of size [lU/d] x x(L)+5). 

Then we apply a CNN to F, as described above. 

AlexNet. To reconstruct from each layer of AlexNet we 
trained a separate network. We used two basic architectures: 
one for reconstructing from convolutional layers and one for 
reconstructing from fully connected layers. The network for 
reconstructing from fully connected layers contains three 
fully connected layers and 5 up-convolutional layers, as 
shown in Table 1 . The network for reconstructing from con¬ 
volutional layers consists of three convolutional and several 
up-convolutional layers (the exact number depends on the 


Layer 

Input 

InSize 

K 

S 

Outsize 

fcl 

AlexNet-FC8 

1000 

- 

- 

4096 

fc2 

fcl 

4096 

- 

- 

4096 

fc3 

fc2 

4096 

- 

- 

4096 

reshape 

fc3 

4096 

- 

- 

4x4x256 

upconvl 

reshape 

4x4x256 

5 

2 

8x8x256 

upconv2 

upconvl 

8x8x256 

5 

2 

16x16x128 

upconv3 

upconv2 

16x16x128 

5 

2 

32x32x64 

upconv4 

upconv3 

32x32x64 

5 

2 

64x64x32 

upconvS 

upconv4 

64x64x32 

5 

2 

128x128x3 


Table 1: Network for reconstructing from AlexNet FC8 fea¬ 
tures. K stands for kernel size, S for stride. 

layer to reconstruct from). Filters in all (up-)convolutional 
layers have 5x5 spatial size. After each layer we apply 
leaky ReLU nonlinearity with slope 0.2, that is, r(x) = x if 
X ^ 0 and r{x) = 0.2 • x if x < 0. 

Training details. We trained networks using a modified 
version of Caffe [10]. As training data we used the Ima- 
geNet [5] training set. In some cases we predicted down- 
sampled images to speed up computations. We used the 
Adam [12] optimizer with jdi = 0.9, P 2 = 0.999 and mini¬ 
batch size 64. For most networks we found an initial learn¬ 
ing rate A = 0.001 to work well. We gradually decreased 
the learning rate towards the end of training. The duration of 
training depended on the network: from 15 epochs (passes 
through the dataset) for shallower networks to 60 epochs for 
deeper ones. 

Quantitative evaluation. As a quantitative measure of 
performance we used the average normalized reconstruc¬ 
tion error, that is the mean of Iki - f{^{xi))\\ 2 /N, where 
Xi is an example from the test set, / is the function imple¬ 
mented by the inversion network and is a normalization 
coefficient equal to the average Euclidean distance between 
images in the test set. The test set we used for quantita¬ 
tive and qualitative evaluations is a subset of the ImageNet 
validation set. 

3. Experiments: shallow representations 

Figures 1 and 3 show reconstructions of several im¬ 
ages from the ImageNet validation set. Normalized recon¬ 
struction error of different approaches is shown in Table 2. 
Clearly, our method significantly outperforms existing ap¬ 
proaches. This is to be expected, since our method explic¬ 
itly aims to minimize the reconstruction error. 


Hoggles [24] 

HOG-^ [19] 

HOG our 

SIFT our 

LBP our 

0.61 

0.63 

0.24 

0.28 

0.38 


Table 2: Normalized error of different methods when recon¬ 
structing from HOG. 
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Figure 2: Reconstructing an image from its HOG descriptors with different methods. 


Colorization. As mentioned above, we compute the fea¬ 
tures based on grayscale images, but the task of the net¬ 
works is to reconstruct the color images. The features do 
not contain any color information, so to predict colors the 
network has to analyze the content of the image and make 
use of a natural image prior it learned during training. It 
does successfully learn to do so, as can be seen in Figures 1 
and 3. Quite often the colors are predicted correctly, espe¬ 
cially for sky, sea, grass, trees. In other cases, the network 
cannot predict the color (for example, people in the top row 
of Figure 3) and leaves some areas gray. Occasionally the 
network predicts the wrong color, such as in the bottom row 
of Figure 3. 

HOG. Figure 2 shows an example image, its HOG rep¬ 
resentation, the results of inversion with existing meth¬ 
ods [24, 19] and with our approach. Most interestingly, the 
network is able to reconstruct the overall brightness of the 
image very well, for example the dark regions are recon¬ 
structed dark. This is quite surprising, since the HOG de¬ 
scriptors are normalized and should not contain information 
about absolute brightness. 

Normalization is always performed with a smoothing 
’epsilon’, so one might imagine that some information 
about the brightness is present even in the normalized fea¬ 
tures. We checked that the network does not make use of 
this information: multiplying the input image by 10 or 0.1 
hardly changes the reconstruction. Therefore, we hypothe¬ 
size that the network reconstructs the overall brightness by 
1) analyzing the distribution of the HOG features (if in a 
cell there is similar amount of gradient in all directions, it is 
probably noise; if there is one dominating gradient, it must 
actually be in the image), 2) accumulating gradients over 
space: if there is much black-to-white gradient in one di¬ 
rection, then probably the brightness in that direction goes 
from dark to bright and 3) using semantic information. 

SIFT. Figure 4 shows an image, the detected SIFT key- 
points and the resulting reconstruction. There are roughly 


Figure 3: Inversion of shallow image representations. Note 
how in the first row the color of grass and trees is predicted 
correctly in all cases, although it is not contained in the fea¬ 
tures. 


Figure 4: Reconstructing an image from SIFT descriptors 
with different methods, (a) an image, (b) SIFT keypoints, 
(c) reconstruction of [26], (d) our reconstruction. 





























Figure 5: Reconstructions from different layers of AlexNet. 
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Figure 6: Reconstructions from layers of AlexNet with our method (top), [19] (middle), and autoencoders (bottom). 



3000 keypoints detected in this image. Although made from 
a sparse set of keypoints, the reconstruction looks very nat¬ 
ural, just a little blurry. To achieve such a clear reconstruc¬ 
tion the network has to properly rotate and scale the descrip¬ 
tors and then stitch them together. Obviously it successfully 
learns to do this. 


For reference we also show a result of another existing 
method [26] for reconstructing images from sparse SIFT de¬ 
scriptors. The results are not directly comparable: while we 
use the SIFT detector providing circular keypoints, Weinza- 
epfel et al. [26] use the Harris affine keypoint detector which 
yields elliptic keypoints, and the number and the locations 
of the keypoints may be different from our case. However, 
the rough number of keypoints is the same, so a qualitative 
comparison is still valid. 


4. Experiments: AlexNet 

We applied our inversion method to different layers of 
AlexNet and performed several additional experiments to 
better understand the feature representations. More results 
are shown in the supplementary material. 

4.1. Reconstructions from different layers 

Figure 5 shows reconstructions from various layers of 
AlexNet. When using features from convolutional layers, 
the reconstructed images look very similar to the input, but 
lose fine details as we progress to higher layers. There is 
an obvious drop in reconstruction quality when going from 
CONv 5 to fc6. However, the reconstructions from higher 
convolutional layers and even fully connected layers pre¬ 
serve color and the approximate object location very well. 
Reconstructions from FC7 and FC8 still look similar to the 
input images, but blurry. This means that high level features 




















Figure 7: Average normalized reconstruction error depend¬ 
ing on the network layer. 

are much less invariant to color and pose than one might ex¬ 
pect: in principle fully connected layers need not preserve 
any information about colors and locations of objects in the 
input image. This is somewhat in contrast with the results 
of [19], as shown in Figure 6. While their reconstructions 
are sharper, the color and position are completely lost in 
reconstructions from higher layers. 

For quantitative evaluation before computing the error 
we up-sample reconstructions to input image size with bi¬ 
linear interpolation. Error curves shown in Figure 7 support 
the conclusions made above. When reconstructing from 
fc6, the error is roughly twice as large as from CONV5. 
Even when reconstructing from FC8, the error is fairly low 
because the network manages to get the color and the rough 
placement of large objects in images right. For lower lay¬ 
ers, the reconstruction error of [19] is still much higher than 
of our method, even though visually the images look some¬ 
what sharper. The reason is that in their reconstructions the 
color and the precise placement of small details do not per¬ 
fectly match the input image, which results in a large overall 
error. 

4.2. Autoencoder training 

Our inversion network can be interpreted as the decoder 
of the representation encoded by AlexNet. The difference to 
an autoencoder is that the encoder part stays fixed and only 
the decoder is optimized. For comparison we also trained 
autoencoders with the same architecture as our reconstruc¬ 
tion nets, i.e., we also allowed the training to fine-tune the 
parameters of the AlexNet part. This provides an upper 
bound on the quality of reconstructions we might expect 
from the inversion networks (with fixed AlexNet). 

As shown in Figure 7, autoencoder training yields 
much lower reconstruction errors when reconstructing from 
higher layers. Also the qualitative results in Figure 6 show 
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Figure 8: The effect of color on classification and recon¬ 
struction from layer fc8. Left to right: input image, recon¬ 
struction from fc8, reconstruction from 5 largest activations 
in fc8, reconstruction from all FC8 activations except the 5 
largest ones. Below each row the network prediction and its 
confidence are shown. 


much better reconstructions with autoencoders. Even from 
CONv5 features, the input image can be reconstructed al¬ 
most perfectly. When reconstructing from fully connected 
layers, the autoencoder results get blurred, too, due to the 
compressed representation, but by far not as much as with 
the fixed AlexNet weights. The gap between the autoen¬ 
coder training and the training with fixed AlexNet gives an 
estimate of the amount of image information lost due to the 
training objective of the AlexNet, which is not based on re¬ 
construction quality. 

An interesting observation with autoencoders is that the 
reconstruction error is quite high even when reconstructing 
from CONVl features, and the best reconstructions were ac¬ 
tually obtained from CONV4. Our explanation is that the 
convolution with stride 4 and consequent max-pooling in 
CONVl loses much information about the image. To de¬ 
crease the reconstruction error, it is beneficial for the net¬ 
work to slightly blur the image instead of guessing the de¬ 
tails. When reconstructing from deeper layers, deeper net¬ 
works can learn a better prior resulting in slightly sharper 
images and slightly lower reconstruction error. For even 
deeper layers, the representation gets too compressed and 
the error increases again. We observed (not shown in the 
paper) that without stride 4 in the first layer, the reconstruc¬ 
tion error of autoencoders got much lower. 

4.3. Case study: Colored apple 

We performed a simple experiment illustrating how the 
color information infiuences classification and how it is pre¬ 
served in the high level features. We took an image of a 
red apple (Figure 8 top left) from Flickr and modified its 
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Figure 9: Reconstructions from different layers of AlexNet with disturbed features. 


hue to make it green or blue. Then we extracted AlexNet 
fc8 features of the resulting images. Remind that FC8 is 
the last layer of the network, so the FC8 features, after ap¬ 
plication of softmax, give the network’s prediction of class 
probabilities. The largest activation, hence, corresponds to 
the network’s prediction of the image class. To check how 
class-dependent the results of inversion are, we passed three 
versions of each feature vector through the inversion net¬ 
work: 1) just the vector itself, 2) all activations except the 
5 largest ones set to zero, 3) the 5 largest activations set to 
zero. 

This leads to several conclusions. First, color clearly can 
be very important for classification, so the feature represen¬ 
tation of the network has to be sensitive to it, at least in 
some cases. Second, the color of the image can be precisely 
reconstructed even from FC8 or, equivalently, from the pre¬ 
dicted class probabilities. Third, the reconstruction quality 
does not depend much on the top predictions of the network 
but rather on the small probabilities of all other classes. This 
is consistent with the ’dark knowledge’ idea of [8]: small 
probabilities of non-predicted classes carry more informa¬ 
tion than the prediction itself. More examples of this are 
shown in the supplementary material. 

4.4. Robustness of the feature representation 

We have shown that high level feature maps preserve rich 
information about the image. How is this information rep¬ 
resented in the feature vector? It is difficult to answer this 
question precisely, but we can gain some insight by perturb¬ 
ing the feature representations in certain ways and observ¬ 
ing images reconstructed from these perturbed features. If 
perturbing the features in a certain way does not change the 
reconstruction much, then the perturbed property is not im¬ 
portant. For example, if setting a non-zero feature to zero 
does not change the reconstruction, then this feature does 
not carry information useful for the reconstruction. 

We applied binarization and dropout. To binarize the fea¬ 
ture vector, we kept the signs of all entries and set their ab¬ 
solute values to a fixed number, selected such that the Eu¬ 
clidean norm of the vector remained unchanged (we tried 


several other strategies, and this one led to the best result). 
For all layers except fc8, feature vector entries are non¬ 
negative, hence, binarization just sets all non-zero entries to 
a fixed positive value. To perform dropout, we randomly set 
50% of the feature vector entries to zero and then normal¬ 
ize the vector to keep its Euclidean norm unchanged (again, 
we found this normalization to work best). Qualitative re¬ 
sults of these perturbations of features in different layers 
of AlexNet are shown in Figure 9. Quantitative results are 
shown in Figure 7. Surprisingly, dropout leads to larger de¬ 
crease in reconstruction accuracy than binarization, even in 
the layers where it had been applied during training. In lay¬ 
ers fc7 and especially fc6, binarization hardly changes the 
reconstruction quality at all. Although it is known that bina¬ 
rized ConvNet features perform well in classification [1], it 
comes as a surprise that for reconstructing the input image 
the exact values of the features are not important. In fc6 
virtually all information about the image is contained in the 
binary code given by the pattern of non-zero activations. 
Figures 7 and 9 show that this binary code only emerges 
when training with the classification objective and dropout, 
while autoencoders are very sensitive to perturbations in the 
features. 

To test the robustness of this binary code, we applied 
binarization and dropout together. We tried dropping out 
50% random activations or 50% least non-zero activations 
and then binarizing. Dropping out the 50% least activations 
reduces the error much less than dropping out 50% random 
activations and is even better than not applying any dropout 
for most layers. However, layers fc6 and fc7 are the most 
interesting ones: here dropping out 50% random activations 
decreases the performance substantially, while dropping out 
50% least activations only results in a small decrease. Pos¬ 
sibly the exact values of the features in fc6 and fc7 do not 
affect the reconstruction much, but they estimate the impor¬ 
tance of different features. 

4.5. Interpolation and random feature vectors 

Another way to analyze the feature representation is by 
traversing the feature manifold and by observing the corre- 











Figure 10: Interpolation between the features of two 
images. 


spending images generated by the reconstruction networks. 
We have seen the reconstructions from feature vectors of 
actual images, but what if a feature vector was not gener¬ 
ated from a natural image? In Figure 10 we show recon¬ 
structions obtained with our networks when interpolating 
between feature vectors of two images. It is interesting 
to see that interpolating CONv5 features leads to a simple 
overlay of images, but the behavior of interpolations when 
reconstructing from FC6 is very different: images smoothly 
morph into each other. More examples, together with the 
results for autoencoders, are shown in the supplementary 
material. 

Another analysis method is by sampling feature vectors 
randomly. Our networks were trained to reconstruct images 
given their feature representations, but the distribution of 
the feature vectors is unknown. Hence, there is no simple 
principled way to sample from our model. However, by 
assuming independence of the features (a very strong and 
wrong assumption!), we can approximate the distribution 
of each dimension of the feature vector separately. To this 
end we simply computed a histogram of each feature over 
a set of 4096 images and sampled from those. We ensured 
that the sparsity of the random samples is the same as that 
of the actual feature vectors. This procedure led to low con¬ 
trast images, perhaps because by independently sampling 
each dimension we did not introduce interactions between 
the features. Multiplying the feature vectors by a constant 
factor a = 2 increases the contrast without affecting other 
properties of the generated images. 

Random samples obtained this way from four top layers 
of AlexNet are shown in Figure 11. No pre-selection was 
performed. While samples from CONv5 look much like ab¬ 
stract art, the samples from fully convolutional layers are 
much more realistic. This shows that the networks learn 
a natural image prior that allows them to produce some¬ 
what realistically looking images from random feature vec¬ 
tors. We found that a much simpler sampling procedure of 



Figure 11: Images generated from random feature vectors 
of top layers of AlexNet. 


fitting a single shifted truncated Gaussian to all feature di¬ 
mensions produces qualitatively very similar images. These 
are shown in the supplementary material together with im¬ 
ages generated from autoencoders, which look much less 
like natural images. 

5. Conclusions 

We have proposed to invert image representations with 
up-convolutional networks and have shown that this yields 
more or less accurate reconstructions of the original images, 
depending on the level of invariance of the feature represen¬ 
tation. The networks implicitly learn natural image priors 
which allow the retrieval of information that is obviously 
lost in the feature representation, such as color or bright¬ 
ness in HOG or SIFT. The method is very fast at test time 
and does not require the gradient of the feature representa¬ 
tion to be inverted. Therefore, it can be applied to virtually 
any image representation. 

Application of our method to the representations learned 
by the AlexNet convolutional network leads do several con¬ 
clusions: 1) Features from all layers of the network, includ¬ 
ing the final FC8 layer, preserve the precise colors and the 
rough position of objects in the image; 2) In higher layers, 
almost all information about the input image is contained in 
the pattern of non-zero activations, not their precise values; 
3) In the layer fc8, most information about the input image 
is contained in small probabilities of those classes that are 
not in top-5 network predictions. 
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Supplementary material 

Network architectures Table 3 shows the architecture of 
AlexNet. Tables 4-8 show the architectures of networks 
we used for inverting different features. After each fully 
connected and convolutional layer there is always a leaky 
ReLU nonlinearity. Networks for inverting HOG and LBP 
have two streams. Stream A compresses the input features 
spatially and accumulates information over large regions. 
We found this crucial to get good estimates of the overall 
brightness of the image. Stream B does not compress spa¬ 
tially and hence can better preserve fine local details. At 
one points the outputs of the two streams are concatenated 
and processed jointly, denoted by “J”. K stands for kernel 
size, S for stride. 

Shallow features details As mentioned, in the paper, for 
all three methods we use implementations from the VLFeat 
library [23] with the default settings. We use the Felzen- 
szwalb et al. version of HOG with cell size 8. For SIFT 
we used 3 levels per octave, the first octave was 0 (corre¬ 
sponding to full resolution), the number of octaves was set 
automatically, effectively searching keypoints of all possi¬ 
ble sizes. 


Layer 

Input 

InSize 

K 

S 

Outsize 

convAl 

HOG 

32x32x31 

5 

2 

16x16x256 

convA2 

convAl 

16x16x256 

5 

2 

8x8x512 

convAS 

convA2 

8x8x512 

3 

2 

4x4x1024 

upconvAl 

convAS 

4x4x 1024 

4 

2 

8x8x512 

upconvA2 

UpconvAl 

8x8x512 

4 

2 

16x16x256 

upconvAS 

upconvA2 

16x16x256 

4 

2 

32x32x128 

convB 1 

HOG 

32x32x31 

5 

1 

32x32x128 

convB2 

convB 1 

32x32x128 

3 

1 

32x32x128 

convJl 

{upconvA3, convB2} 

32x32x256 

3 

1 

32x32x256 

convJ2 

convJ1 

32x32x256 

3 

1 

32x32x128 

upconvJ4 

convJ2 

32x32x128 

4 

2 

64 X 64 X 64 

upconvJS 

upconvJ4 

64x64x64 

4 

2 

128x128x32 

upconvJ6 

upconvJS 

128x128x32 

4 

2 

256 X 256 X 3 


Table 4: Network for reconstructing from HOG features. 


Layer 

Input 

InSize 

K 

S 

Outsize 

convl 

SIFT 

64x64x133 

5 

2 

32x32x256 

conv2 

convl 

32x32x256 

3 

2 

16x16x512 

conv3 

conv2 

16x16x512 

3 

2 

8x8x1024 

conv4 

conv3 

8x8x 1024 

3 

2 

4 X 4 X 2048 

conv5 

conv4 

4 X 4 X 2048 

3 

1 

4 X 4 X 2048 

conv6 

convS 

4x4x2048 

3 

1 

4x4x1024 

upconvl 

conv6 

4x4x1024 

4 

2 

8x8x512 

upconv2 

upconvl 

8x8x512 

4 

2 

16x16x256 

upconvS 

upconv2 

16x16x256 

4 

2 

32x32x128 

upconv4 

upconv3 

32x32x128 

4 

2 

64x64x64 

UpconvS 

upconv4 

64x64x64 

4 

2 

128x128x32 

upconv6 

UpconvS 

128x128x32 

4 

2 

256 X 256 X 3 


Table 5: Network for reconstructing from SIFT features. 


Layer 

Input 

InSize 

K 

S 

Outsize 

convAl 

LBP 

16x16x58 

5 

2 

8x8x256 

convA2 

convAl 

8x 8x 256 

5 

2 

4x4x512 

convA3 

convA2 

4x4x512 

3 

1 

4x4x 1024 

UpconvAl 

convA3 

4x4x1024 

4 

2 

8x 8x512 

upconvA2 

upconvAl 

8x8x512 

4 

2 

16x16x256 

convB 1 

LBP 

16x16x58 

5 

1 

16x16x128 

convB 2 

convB 1 

16x16x128 

3 

1 

16x16x128 

convl1 

{upconvA2, convB2} 

16x16x384 

3 

1 

16x16x256 

convJ2 

convl1 

16x16x256 

3 

1 

16x16x128 

upconvJS 

convl2 

16x16x128 

4 

2 

32x32x128 

upconvJ4 

upconvl3 

32x32x128 

4 

2 

64x64x64 

upconvJS 

upconvl4 

64 X 64 X 64 

4 

2 

128x128x32 

upconvJ6 

upconvlS 

128x128x32 

4 

2 

256 X 256 X 3 


Table 6: Network for reconstructing from LBP features. 


Layer 

Input 

InSize 

K 

S 

Outsize 

convl 

AlexNet-coNvS 

6x6x256 

3 

1 

6x6x256 

conv2 

convl 

6 X 6 X 256 

3 

1 

6 X 6 X 256 

conv3 

conv2 

6x6x256 

3 

1 

6x6x256 

upconvl 

conv3 

6x6x256 

5 

2 

12x12x256 

upconv2 

upconvl 

12x12x256 

5 

2 

24x24x128 

upconv3 

upconv2 

24x24x128 

5 

2 

48 X 48 X 64 

upconv4 

upconv3 

48 X 48 X 64 

5 

2 

96 X 96 X 32 

UpconvS 

upconv4 

96x96x32 

5 

2 

192x192x3 


Table 7: Network for reconstructing from AlexNet CONVS 
features. 

The LBP version we used works with 3x3 pixel neigh¬ 
borhoods. Each of the 8 non-central bits is equal to one if 
the corresponding pixel is brighter than the central one. All 
possible 256 patterns are quantized into 58 patterns. These 
include 56 patterns with exactly one transition from 0 to 1 
when going around the central pixel, plus one quantized pat¬ 
tern comprising two uniform patterns, plus one quantized 
pattern containing all other patterns. The quantized LBP 
patterns are then grouped into local histograms over cells of 
16 X 16 pixels. 

Experiments: shallow representations Figure 12 shows 
several images and their reconstructions from HOG, SIFT 
and LBP. HOG allows for the best reconstruction, SIFT 
slightly worse, LBP yet slightly worse. Colors are often 


Layer 

Input 

InSize 

K 

S 

Outsize 

fcl 

AlexNet-FC8 

1000 

- 

- 

4096 

fc2 

fcl 

4096 

- 

- 

4096 

fc3 

fc2 

4096 

- 

- 

4096 

reshape 

fc3 

4096 

- 

- 

4x4x256 

upconvl 

reshape 

4x4x256 

5 

2 

8x8x256 

upconv2 

upconvl 

8x8x256 

5 

2 

16x16x128 

UpconvS 

upconv2 

16x16x128 

5 

2 

32x32x64 

upconv4 

UpconvS 

32x32x64 

5 

2 

64x64x32 

UpconvS 

upconv4 

64x64x32 

5 

2 

128x128x3 


Table 8: Network for reconstructing from AlexNet fc8 fea¬ 
tures. 


































layer 

CONVl 

conv2 

conv3 

conv4 

conv5 

fc6 

fc7 

fc8 

processing 

steps 

convl 

relul 

mpooll 

norml 

conv2 

relu2 

mpool2 

norm2 

conv3 

relu3 

conv4 

relu4 

conv5 

relu5 

mpool5 

fc6 

relu6 

drop6 

fc7 

relu7 

drop7 

fc8 

out size 

55 

27 

27 

13 

13 

13 

13 

6 

1 

1 

1 

1 

1 

out channels 

96 

96 

256 

256 

384 

384 

256 

256 

4096 

4096 

4096 

4096 

1000 


Table 3: Summary of the AlexNet network. Input image size is 227 x 227. 
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Figure 12: Inversion of shallow image representations. 


reconstructed correctly, but sometimes are wrong, for ex¬ 
ample in the last row. Interestingly, all network typically 
agree on estimated colors. 

Experiments: AlexNet We show here several additional 
figures similar to ones from the main paper. Reconstruc¬ 
tions from different layers of AlexNet are shown in Fig¬ 
ure 13 . Figure 14 shows results illustrating the ’dark knowl¬ 
edge’ hypothesis, similar to Figure 8 from the main paper. 
We reconstruct from all FC8 features, as well as from only 
5 largest ones or all except the 5 largest ones. It turns out 
that the top 5 activations are not very important. 


Figure 15 shows images generated by activating single 
neurons in different layers and setting all other neurons to 
zero. Particularly interpretable are images generated this 
way from FC8. Every FC8 neuron corresponds to a class. 
Hence the image generated from the activation of, say, “ap¬ 
ple” neuron, could be expected to be a stereotypical apple. 
What we observe looks rather like it might be the average of 
all images of the class. For some classes the reconstructions 
are somewhat interpretable, for others - not so much. 

Qualitative comparison of reconstructions with our 
method to the reconstructions of [19] and the results with 
AlexNet-based autoencoders is given in Figure 16 . 

Reconstructions from feature vectors obtained by inter¬ 
polating between feature vectors of two images are shown in 
Figure 17 , both for fixed AlexNet and autoencoder training. 
More examples of such interpolations with fixed AlexNet 
are shown in Figure 18 . 

As described in section 5.5 of the main paper, we tried 
two different distributions for sampling random feature ac¬ 
tivations: a histogram-based and a truncated Gaussian. Fig¬ 
ure 19 shows the results with fixed AlexNet network and 
truncated Gaussian distribution. Figures 20 and 21 show 
images generated with autoencoder-trained networks. Note 
that images generated from autoencoders look much less 
realistic than images generated with a network with fixed 
AlexNet weights. This indicates that reconstructing from 
AlexNet features requires a strong natural image prior. 




































Figure 13: Reconstructions from different layers of AlexNet. 
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Figure 14: Left to right: input image, 
reconstruction from fc8, reconstruction 
from 5 largest activations in FC8, recon¬ 
struction from all FC8 activations except 
5 largest ones. 


Figure 15: Reconstructions from single neuron activations in the fully con¬ 
nected layers of AlexNet. The fc8 neurons correspond to classes, left to 
right: kite, convertible, desktop computer, school bus, street sign, soup 
bowl, bell pepper, soccer ball. 
























Figure 17: Interpolation between the features of two images. Left: AlexNet weights fixed, right: autoencoder. 
























Figure 19: Images generated from random feature vectors of top layers of AlexNet with the simpler truncated Gaussian 
distribution (see section 5.5 of the main paper). 



Figure 20: Images generated from random feature vectors of top layers of AlexNet-based autoencoders with the histogram- 
based distribution (see section 5.5 of the main paper). 













fc7 


fc8 


lJ9S]CaEih^K:lL^.]Z^ 



Figure 21: Images generated from random feature vectors of top layers of AlexNet-based autoencoders with the simpler 
truncated Gaussian distribution (see section 5.5 of the main paper). 



